Method and System for Assessment of Regulatory Variants in a Genome

ABSTRACT

The present invention provides methods embodied in a system that can be applied to genetic information comprising an individual genome to assess the regulatory impact of specific genetic variants and their possible impact on biological function or disease pathology.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application No. 61/526,242 filed Aug. 22, 2012, which is hereby incorporated by reference in its entirety for all purposes.

This application claims priority to U.S. Provisional Application No. 61/526,095 filed Aug. 22, 2012, which is hereby incorporated by reference in its entirety for all purposes.

GOVERNMENT RIGHTS

This invention was made with Government support under contract HG000237 awarded by the National Institutes of Health. The Government has certain rights in this invention.

BACKGROUND OF THE INVENTION

Genome-wide association studies have discovered many genetic loci associated with disease, but the molecular basis of these associations is often unresolved. Genome-wide regulatory and gene expression profiles measured across individuals and diseases reflect downstream effects of genetic variation, and may allow for functional annotation of disease-associated loci.

Complete genome sequences of individual patients will soon become integrated as part of routine clinical care. There exists a need for interpreting the clinical significance of novel genetic variants presenting in a patient's personal genome, which are known to be associated with diverse clinical disorders. Current tools and approaches for clinical assessment of genetic variation do not explicitly consider gene regulatory information and are typically focused on specific gene coding regions.

SUMMARY OF THE INVENTION

Embodiments of the present invention enable genome-wide systematic evaluation of potentially clinically relevant genetic variation in a personal genome. Among other things, the present invention provides methods embodied in a system that can be applied to genetic information comprising an individual genome to assess the regulatory impact of specific genetic variants and their possible impact on biological function or disease pathology.

An embodiment of the invention is comprised of databases and algorithms embodied in software, where the databases contain information providing genome-wide quantitative and genetic profiles of transcription factor binding measured across a multitude of individual human genomes, information of genetic variants associated with disease conditions, information on DNA motifs associated with transcription factor binding, as well as molecular profiles of disease pathology.

Embodiments of the present invention use genome-wide quantitative gene regulatory information to assess total genetic variation presenting in an individual genome. The present invention provides the ability to infer transcription factor binding events from genotypes. Also, the present invention provides the ability to associate individual variation in gene regulatory regions with biological function and disease pathology.

Applications of the present invention include clinical assessment of personal genomes, clinical assessment of cancer genomes, and interpretation of genetic disease associations, discovery of regulatory DNA biomarkers. In other embodiments, additional types of gene regulatory information can be added.

These and other embodiments can be more fully appreciated upon an understanding of the detailed description of the invention as disclosed below in conjunction with the attached figures.

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings will be used to more fully describe embodiments of the present invention.

FIGS. 1A and 1B are block diagrams of computer systems on which embodiments of the present invention can be practiced.

FIG. 2 is a block diagram of a method according to an embodiment of the invention.

FIG. 3 is a block diagram of a method according to an embodiment of the invention.

FIG. 4 is a graph illustrating the variability of transcription factor binding in the analysis of regulatory variation.

FIG. 5 is an illustration of the manner in which a single SNP is associated with NFkB binding.

FIG. 6 is an illustration of the manner in which the EBF1 motif affects NFkB binding.

FIG. 7 is a graph demonstrating the manner in which disease-associated SNPs in NFkB binding regions are more pleiotropic.

FIG. 8 is a graph illustrating the effects of transcription factor binding and disease SNPs.

FIG. 9 is a graph illustrating the manner in which SNPs associated with inflammatory and autoimmune diseases were overrepresented in NFκB binding regions.

DETAILED DESCRIPTION

Among other things, the present invention relates to methods, techniques, and algorithms that are intended to be implemented in digital computer system 100 such as generally shown in FIG. 1A. Such a digital computer or embedded device is well-known in the art and may include the following.

Computer system 100 may include at least one central processing unit 102 but may include many processors or processing cores. Computer system 100 may further include memory 104 in different forms such as RAM, ROM, hard disk, optical drives, and removable drives that may further include drive controllers and other hardware. Auxiliary storage 112 may also be include that can be similar to memory 104 but may be more remotely incorporated such as in a distributed computer system with distributed memory capabilities.

Computer system 100 may further include at least one output device 108 such as a display unit, video hardware, or other peripherals (e.g., printer). At least one input device 106 may also be included in computer system 100 that may include a pointing device (e.g., mouse), a text input device (e.g., keyboard), or touch screen.

Communications interfaces 114 also form an important aspect of computer system 100 especially where computer system 100 is deployed as a distributed computer system. Computer interfaces 114 may include LAN network adapters, WAN network adapters, wireless interfaces, Bluetooth interfaces, modems and other networking interfaces as currently available and as may be developed in the future.

Computer system 100 may further include other components 116 that may be generally available components as well as specially developed components for implementation of the present invention. Importantly, computer system 100 incorporates various data buses 118 that are intended to allow for communication of the various components of computer system 100. Data buses 118 include, for example, input/output buses and bus controllers.

Indeed, the present invention is not limited to computer system 100 as known at the time of the invention. Instead, the present invention is intended to be deployed in future computer systems with more advanced technology that can make use of all aspects of the present invention. It is expected that computer technology will continue to advance but one of ordinary skill in the art will be able to take the present disclosure and implement the described teachings on the more advanced computers or other digital devices such as mobile telephones or “smart” televisions as they become available.

Moreover, the present invention may be implemented on one or more distributed computers. Still further, the present invention may be implemented in various types of software languages including C, C++, and others. Also, one of ordinary skill in the art is familiar with compiling software source code into executable software that may be stored in various forms and in various media (e.g., magnetic, optical, solid state, etc.). One of ordinary skill in the art is familiar with the use of computers and software languages and, with an understanding of the present disclosure, will be able to implement the present teachings for use on a wide variety of computers.

The present disclosure provides a detailed explanation of the present invention with detailed explanations that allow one of ordinary skill in the art to implement the present invention into a computerized method. Certain of these and other details are not included in the present disclosure so as not to detract from the teachings presented herein but it is understood that one of ordinary skill in the art would be familiar with such details.

In an embodiment of the invention as shown in FIG. 1B, a computer server that implements certain of the methods of the invention is remotely situated from a user. Computer server 122 is communicatively coupled so as to receive information from a user; likewise, computer server 122 is communicatively coupled so as to send information to a user. In an embodiment of the invention, the user uses user computing device 124 so as to access computer server 122 via network 126. Network 126 can be the internet, a local network, a private network, a public network, or any other appropriate network as may be appropriate to implement the invention as described herein.

User computing device 124 can be implemented in various forms such as desktop computer 128, laptop computer 130, smart phone 132, or tablet device 134. Other devices that may be developed and are capable of the computing actions described herein are also appropriate for use in conjunction with the present invention.

In the present disclosure, computing and other activities will be described as being conducted on either computer server 122 or user computing device 124. It should be understood, however, that many if not all of such activities may be reassigned from one to the other device while keeping within the present teachings. For example, for certain steps computations that may be described as being performed on computer server 122, a different embodiment may have such computations performed on user computing device 124.

In an embodiment of the invention, computer server 122 is implemented as a web server on which Apache HTTP web server software is run. Computer server 122 can also be implemented in other manners such as an Oracle web server (known as Oracle iPlanet Web Server). In an embodiment computer server 122 is a UNIX-based machine but can also be implemented in other forms such as a Windows-based machine. Configured as a web server, computer server 122 is configured to serve web pages over network 126 such as the internet.

In an embodiment, user computing device 124 is configured so as to run web browser software. For example, where user computing device 124 is implemented as desktop computer 128 or laptop computer 130, currently available web browser software includes Internet Explorer, Firefox, and Chrome. Other browser software is available for different applications of user computing device 124. Still other software is expected to be developed in the future that is able to execute certain steps of the present invention.

In an embodiment, user computing device 124, through the use of appropriate software, queries computer server 122. Responsive to such query, computer server 122 provides information so as to display certain graphics and text on user computing device. In an embodiment, the information provided by computer server 122 is in the form of HTML that can be interpreted by and properly displayed on user computing device 124. Computer server 122 may provide other information that can be interpreted on user computing device.

It has been found that transcription factor binding is significant in the analysis of regulatory variation. For example, as shown in FIG. 4, whereas individuals vary very little at approximately 1/1300 bp and vary at approximately 1/1000 in promoter regions, transcription factor binding is much more variable. As shown in FIG. 4, promoter sequences 406, coding region nucleotides 404, and amino acid sequences 408 vary very little. Note, however that, transcription factor binding 402 has been found to vary in approximately 7.5% of NFkB binding sites. For purposes of personal genome sequencing, it has been found that 7.5% of SNPs are in transcription factor binding sites and 21% of SNPs are in DNasel HS sites. Embodiments of the present invention make use of these facts for purposes of improved functional and clinical analysis. For example, a single SNP is associated with NFkB binding as shown in FIG. 5.

In a certain sense, it is important to consider how binding of one factor can affect the binding of another. For example, as shown in FIG. 5, Stat1 motif 502 affects Stat1 504 binding. But it is also important to note that Stat1 504 binding affects NFkB 508 binding, for example, despite the conservation of the NFkB motif 510. More particularly, for example, it has been found that EBF1 motif affects NFkB binding as shown in FIG. 6.

A systematic approach is presented herein to combine disease association, transcription factor binding, and gene expression data to assess the functional consequences of variants associated with hundreds of human diseases. In an analysis of genome-wide binding profiles of NFκB, it was found that disease-associated SNPs are enriched in NFκB binding regions overall, and specifically for inflammatory mediated diseases, such as asthma, rheumatoid arthritis, and coronary artery disease. Using genome-wide binding variation information for eight fully sequenced individuals, it was found that regions of NFκB binding correlated with disease-associated variants in an allele-specific manner (see pipeline method of FIG. 3 to be discussed below). Furthermore, it was found that this binding variation is often correlated with expression of nearby genes, which are also found to have altered expression in independent profiling of the variant-associated disease condition. In this systematic approach, a loop is closed in biological context-free association studies and assign putative function to many disease-associated SNPs. In other embodiments of the invention, these predictions can be validated for atherosclerosis, asthma, and/or rheumatoid arthritis. It should, therefore, be noted that although certain particular embodiments will be discussed herein, such descriptions are illustrative and do not limit the scope of the present invention.

The association between genotype and phenotype is a fundamental problem in biology and translation medicine. Genome-wide association studies (GWASs) have identified many genetic variants associated with diseases [see Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106, 9362-9367 (2009); note that these and other references cited herein are incorporated by reference for all purposes], but such approaches rely on “tag” single nucleotide polymorphisms (SNPs) found on DNA microarrays. While these SNPs may lie in or near gene regions, their specific influences on the biology of disease are not necessarily determined in typical GWASs [Green, E. D., Guyer, M. S. National Human Genome Research Institute Charting a course for genomic medicine from base pairs to bedside. Nature 470, 204-213 (2011)]. Furthermore, disease-associated SNPs that are found outside of genic regions are often not further investigated because they are of unknown function.

Systems biology can provide an approach to bridge the gap between genotype and phenotype. For example, human variation in transcription factor (TF) binding has been correlated with polymorphisms in motifs for NFκB and PolII in ten individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010); Karczewski, K. J. et al. Discovering Cooperative Transcription Factor Associations using Binding Variation Information and the ALPHABIT Pipeline. 1-25 (2011)] and regulatory features across dozens of cell lines have been mapped extensively by the ENCODE project [Birney, E. et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799-816 (2007)].

It is, therefore, expected that polymorphisms that affect transcription factor binding can have a significant influence on disease because the differences in TF binding (that lead to downstream differences in expression) may be the true underlying cause of the disease association of the SNPs. These functional biology-rich sources of data can, therefore, be leveraged to suggest putative function for previously unannotated disease-associated SNPs.

In the present disclosure, the role of transcription factor binding sites in disease is described. As a non-limiting case study, genome-wide enrichments are explored for disease SNPs in NFκB (p65) binding regions to predict genotype-specific binding events associated with disease.

Shown in FIG. 2 is a block diagram of a method according to an embodiment of the present invention. As shown in step 202, information is received regarding genome-wide quantitative and genetic profiles of transcription factor binding measured across a multitude of individual human genomes. At step 204, information is received regarding genetic variants associated with disease conditions. Information regarding DNA motifs associated with transcription factor binding is received at step 206. At step 208, molecular profiles of disease pathologies are received. In an embodiment of the present invention, the information of steps 202 through 208 is contained in databases. In an embodiment, certain of the information of steps 202 through 208 is automatically generated or manually curated as further disclosed in copending application Ser. No. ______, entitled “Method and System for the Use of Biomarkers for Regulatory Dysfunction in Disease,” which is herein incorporated by reference for all purposes. In an embodiment of the invention, the databases are maintained locally on a computer system on which the method of the present invention are processed. In another embodiment of the invention, the databases are maintained remotely from a computer on which certain steps of the present invention are performed.

As shown in step 210 of FIG. 2, the regulatory impact of specific genetic variants using the received information is assessed. Examples this analysis are provided below, however, those of ordinary skill in the art will understand that many other types of analyses are possible without deviating from the teachings of the present invention. Further processing can be performed such as shown in step 212 where the impact of genetic variants on biological function is assessed. Also, an embodiment of the present invention further performs step 214 where the impact of genetic variants on disease pathology is assessed. One of ordinary skill in the art will understand, however, that many other variations of the present invention are possible.

NFκB Binding Regions are Enriched for Disease Associated SNPs

In a particular embodiment of the present invention, a compendium of disease SNPs [see Ashley, E. A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525-1535 (2010); Chen, R., Davydov, E. V., Sirota, M. & Butte, A. J. Non-synonymous and synonymous coding SNPs show similar likelihood and effect size of human disease association. PLoS ONE 5, e13574 (2010)] was intersected with a set of 15,522 NFκB binding regions found in lymphoblastoid cell lines from ten individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)]. It was found that established disease-associated SNPs were overabundant in regions bound by NFκB (χ²=292.9; p=1.1e-65; Fisher's OR=2.95). These associations are not biased by the platforms used for disease association discovery, as NFκB regions are underrepresented on Affymetrix 6.0 and 500K arrays (Fisher's OR=0.8 and 0.82, respectively) and only slightly overrepresented on Illumina 550K and 1M (Fisher's OR= and 1.42, respectively), which represented a smaller portion of this analysis. Additionally, binding sites of a known interacting factor, Stat1, were also highly enriched for disease-SNPs; this enrichment was not present in promoter regions, as defined by PolII binding as shown in FIG. 8.

As shown in FIG. 9, SNPs associated with inflammatory and autoimmune diseases, including Rheumatoid Arthritis, Asthma, and Systemic Lupus Erythematosus, were highly overrepresented in NFκB binding regions. These SNPs were enriched compared to all SNPs as well as the subset of disease-associated SNPs.

Disease-associated SNPs in NFκB binding regions are more pleiotropic (e.g., typically associated with more diseases) than the collection of known disease-associated SNPs (1.33 vs. 1.15; t-test p-value=8.7e-4; Mann Whitney U-test p-value=2.7e-7). For example, as shown in FIG. 7, on average, it was found that 1.5 diseases are associated per SNP and the average disease-associated SNP in NFkB binding regions was 1.33 diseases per SNP. In FIG. 7, note that areas 702 correspond to NFkB SNPs and areas 704 correspond to all disease SNPs.

Disease Associated SNPs are Found in More Biologically Relevant Binding Regions

NFκB binding regions that harbor disease-associated SNPs are more strongly bound by NFκB, as determined by ChIP-Seq binding intensity, compared to the background of all NFκB binding regions. Additionally, these binding regions are less variable, indicating the potential for evolutionary constraint on these regions.

SNPs in NFκB Binding Regions Suggest a Mechanism for the Biology of Disease

In a systematic effort to assign functional annotation to disease-associated SNPs, a pipeline as shown in FIG. 3 was developed to discover putative SNPs that may be associated with an effect on NFκB binding. As shown in FIG. 3, at step 302 a compendium of 24K disease-associated SNPs was received from a database. At step 304, 15,522 genome-wide transcription factor binding profiles was received. At step 306, the NFκB binding profiles for eight individuals was considered. Genome-wide expression profiling was then performed (e.g., RNA-Seq). Finally, at step 310, the human genomic disease expression information was collected in a database.

Using genotype and NFκB binding information from eight individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)], a preliminary, lower-power analysis was performed to identify candidate SNPs. In an assessment of SNPs in NFκB binding regions in linkage disequilibrium (R2>0.5) with a disease-associated SNP, SNPs associated with NFκB binding were found by an ANOVA. For instance, rs6135095, a SNP previously reported to be associated with atherosclerosis, shows significant association between genotype and NFκB binding in the 8 cell lines queried.

These variants associated with NFκB binding were linked with downstream expression effects of nearby genes. Considering all genes within 200 kb to be potential targets, disease-associated SNPs were found to be associated with changes in NFκB binding which were correlated with expression of nearby genes.

In an independent validation experiment, aortic tissues from 10 individuals were genotyped and certain of them certain of them were found with rs6135095 CT and TT. This SNP was found to be associated with binding of NFκB (by ChIP-qPCR) as well as expression of nearby genes (SIRPG, etc).

ENCODE/HS Sites

Poll binding regions were not overrepresented for disease SNPs. Therefore, an enrichment for disease-associated SNPs was computed for various factors in several cell lines and it was found that these SNPs were overrepresented in certain of those factors.

Data Sources

Data on disease-SNP associations (p<0.01) were used as in [Ashley, E. A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525-1535 (2010); Chen, R., Davydov, E. V., Sirota, M. & Butte, A. J. Non-synonymous and synonymous coding SNPs show similar likelihood and effect size of human disease association. PLoS ONE 5, e13574 (2010)]. ChIP-Seq data on eight cell lines with individual genome sequences was obtained from [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)]. All analyses were performed using dbSNP release 132 and hg19 coordinates.

ASB/ASE

ChIP-Seq reads were mapped to hg19 assembly of the human genome using BWA. PCR duplicates were filtered using Picard tools. Variant calling files were downloaded from 1000 Genomes and converted to hg19 coordinates with VCF tools. Allele-specific binding (ASB) was determined on a per-heterozygote per-individual basis for the ten individuals. Reads were filtered to be above MAQ 30 mapping quality. For each individual, a binomial probability of success was determined based on the probability that a reference allele maps to the genome compared to a non-reference. Allele-specific expression (ASE) was similarly determined using reads from the transcriptome of each individual.

Statistical Analysis

Overall associations between NFκB binding regions and disease-associated SNPs were ascertained by chi-squared and Fisher's exact tests. Associations between individual SNPs and binding strengths were tested by two sample t-tests (with two genotypes grouped) or ANOVA for all 3 genotypes. All statistical analysis methods were performed using R statistical software (2.12.1).

Using embodiments according to the present invention, the previously unknown functional significance of regulatory variants is possible. Indeed, embodiment of the present invention can be used to discover new transcription factor interactions. For example, using the present invention disease-associated variants can be connected to molecular pathophysiology and can explain the function of non-coding SNPs.

It should be appreciated by those skilled in the art that the specific embodiments disclosed above may be readily utilized as a basis for modifying or designing other techniques for carrying out the same purposes of the present invention. It should also be appreciated by those skilled in the art that such modifications do not depart from the scope of the invention as set forth in the appended claims. 

We claim:
 1. A computer-implemented method for analyzing genetic variants, comprising: receiving information regarding genome-wide quantitative and genetic profiles; receiving information regarding genetic variants associated with disease conditions; and assessing the regulatory impact of specific genetic variants using the received information.
 2. The method of claim 1, wherein the genome wide information includes measures of transcription factor binding across a multitude of individual human genomes.
 3. The method of claim 1, further comprising receiving information regarding DNA motifs associated with transcription factor binding.
 4. The method of claim 1, further comprising receiving information regarding molecular profiles of disease pathologies.
 5. The method of claim 1, further comprising assessing the impact of genetic variants on biological function.
 6. The method of claim 1, further comprising assessing the impact of genetic variants on disease impact.
 7. The method of claim 1, wherein the genome wide information includes transcription factor binding profiles.
 8. The method of claim 7, further comprising receiving NFkB binding profiles for a first group of individuals.
 9. The method of claim 1, wherein assessing the regulatory impact includes performing a genome-wide expression profiling.
 10. The method of claim 9, wherein the genome-wide expression profiling includes an RNA sequencing analysis.
 11. The method of claim 1, further comprising assessing the impact of genetic variants on biological function.
 12. The method of claim 1, further comprising assessing the impact of genetic variants on disease pathology.
 13. A computer-readable medium including instructions that, when executed by a processing unit, cause the processing unit to implement a method for analyzing genetic variants, by performing the steps of: receiving information regarding genome-wide quantitative and genetic profiles; receiving information regarding genetic variants associated with disease conditions; and assessing the regulatory impact of specific genetic variants using the received information.
 14. The computer-readable medium of claim 13, wherein the genome wide information includes measures of transcription factor binding across a multitude of individual human genomes.
 15. The computer-readable medium of claim 13, further comprising receiving information regarding DNA motifs associated with transcription factor binding.
 16. The computer-readable medium of claim 13, further comprising receiving information regarding molecular profiles of disease pathologies.
 17. The computer-readable medium of claim 13, further comprising assessing the impact of genetic variants on biological function.
 18. The computer-readable medium of claim 13, further comprising assessing the impact of genetic variants on disease impact.
 19. The computer-readable medium of claim 13, wherein the genome wide information includes transcription factor binding profiles.
 20. The computer-readable medium of claim 19, further comprising receiving NFkB binding profiles for a first group of individuals.
 21. The computer-readable medium of claim 13, wherein assessing the regulatory impact includes performing a genome-wide expression profiling.
 22. The computer-readable medium of claim 21, wherein the genome-wide expression profiling includes an RNA sequencing analysis.
 23. The computer-readable medium of claim 13, further comprising assessing the impact of genetic variants on biological function.
 24. The computer-readable medium of claim 13, further comprising assessing the impact of genetic variants on disease pathology.
 25. A computing device comprising: a data bus; a memory unit coupled to the data bus; at least one processing unit coupled to the data bus and configured to receive information regarding genome-wide quantitative and genetic profiles, receive information regarding genetic variants associated with disease conditions, assess the regulatory impact of specific genetic variants using the received information. 